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Abstract 

The likelihood principle makes strong claims about the nature of statistical ev¬ 
idence but is controversial. Its claims are undermined by the existence of several 
examples that are assumed to show that it allows, with unity probability, domination 
of all other hypotheses by the uninteresting, determinist hypothesis that whatever 
happened had to happen. Such examples are generally assumed to be important ob¬ 
stacles to the application of the likelihood principle: they are counter-examples to the 
principle. A re-analysis of Birnbaum’s 1969 “counter-example”, demonstrates that 
the standardly reported analyses of such examples involves an inappropriate treat¬ 
ment of a nuisance parameter and that, when the nuisance parameter is adequately 
considered, there is no conflict between the evidential consequences of the likelihood 
principle and the intuitive evidential account of the problem. It also shows that the 
conclusion that the likelihood principle allows the determinist hypothesis to domi¬ 
nate with unity probability requires a misconception about the scope of the likelihood 
principle or an inappropriately specified statistical model. Whatever happened did 
not have to happen. 
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1 Introduction 


The likelihood principle offers an approach to utilizing the information in the whole of 
the likelihood function for inference and it appears to provide a principled basis for the 
assessment of statistical evidence. However, the likelihood principle is controversial. It 
might seem to be a natural ingredient of most Bayesian approaches, and in deed it is a 


natural consequence of Bayes theorem and the product rule of probabilities (1 Ja ynes 


200 3. 


pp. 250-251), but Bayesians would usually not be comfortable with basing inference on 
likelihoods rather than probabilities. Of course, a Bayesian justification is not very com¬ 
pelling to frequentists who are effectively required to assume the likelihood principle is 
false because of its conflict with an evidential basis for frequentist approaches. Thus the 
likelihood principle has few friends. 

The likelihood principle has been proved to be entailed by the widely accepted princi¬ 
ples^ sufficiency and conditionality flBirnbauml 119621 ). but that proof has b een challenged 


(Mayo 2014) by a disproof that has itself been challenged in a different proof flGandenberger 


20141 ). The existence, or not, of a proof may be a. sic 


e-issue because it is possible tha 


likelihood is a p rimitive, axiomatic postulate (jJeffrevs 


Birnbaum 


1938 


Fisher 


1938, 


Edwards 


1992, 


1969). As Forster & Sober pointed out “If likelihood is epistemologically fun¬ 


damental, then we should not be surprised to find that it cannot be justified in terms of 
anything more fundamental” and that we could therefore expect only th at the likelih ood 


principle “coincides with and systematizes our intuitions about examples” flForster fe Sober 


2004 . p. 155). If agreement with our intuitions about examples is all we can expect then 


examples where the likelihood principle conflicts with our intuitions would stand as poten¬ 
tially fatal counter-examples. 


The counter-example that is the subject of this paper is due to A 
who published the original proof of the likelihood principle in 1 96 2 


an 


Birnbaum (1969), 


He must have con¬ 


sidered this example to be both important and persuasive, because he cites it in a letter 


to Nature written as a response to a pape r 


likelihood-based scientific inferences (Edwards 


Dy A.W.F. Edwards promoting the use of 


1969): 


It is regrettable that Edwards’s interesting article, supporting the likelihood and prior 
likelihood concepts, did not point out the specific criticisms of likelihood (and Bayesian) 
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concepts that seem to dissuade most theoretical and applied statisticians from adopting 


them. 


( Birnbaum 


1 970 . p. 1033) 


Birnbaum also notes in that letter that he changed his mind about the likelihood between 
1962 and 1969 and regretted not having made that clear at an earlier date. Edwards replied 
to Birnbaum’s letter: 


Birnbaum has retracted his former support for the likelihood approach to scientific infer¬ 


ence after considering an example in which the approach would, it seems, always 
the wrong conclusion. — (Edwards 


ead to 


1970) 


Birnbaum was not the only one-time proponent of the likelihood principle to have a change 


of heart as a conse quence o : 


law of likelihood (Hacking 


appa rent counter-examples. Ian Hacking, who first specified the 


1965) also wrote of such a counter-example as being important 


in the document where he notes that he had changed his mind concerning likelihood-based 
inference. The document was an extensive review of Edwards’s 1972 book Likelihood , and 
Hacking notes a variant of Birnbaum’s example and writes that “even on those increasingly 
rare days when I will rank hypotheses in order of their likelih oods. I cannot take the actual 


log-likelihood number as an objective measure of anything” ( Hack i ng 1972, p. 137). 


Interestingly, the issue of The British Journal of Philosophy of Science containing Hack¬ 
ing’s review of Edwards’s book also contains a (belated) review by G.A. Barnard of Hack¬ 
ing’s own book of 1965, Logic of Statistical Inference. Barnard was another of the leading 
proponent of likelihood-based inference, and his review notes the same problem as that 
raised by Birnbaum’s example in the most general form: 

The difficulty with Hacking’s account is that it leads him to say (p. 89): ‘An hypothesis 
should be rejected if and only if there is some rival hypothesis much better supported than 
it is.’ But there always is such a rival hypothesis, viz. that things just had to t urn out 
the way they actually did. — (Barnard (1972], p. 129) 


Given the seminal roles of Birnbaum, Hacking, Barnard and Edwards in the development 
of our understanding of the likelihood principle and the law of likelihood, their concerns 
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about such counter-examples should be important to anyone with an interest in likelihood 
and statistical evidence. 

1.1 Structure 

The following sections of this paper consist of: (i) an introduction to likelihood functions 
and the likelihood principle, along with an example that contrasts probability and like¬ 
lihood; (ii) a pair of simple examples which are similar to Birnbaum’s problem but do 
not appear to throw up any conflict between intuition and the consequences of the likeli¬ 
hood principle; (iii) Birnbaum’s example presented verbatim; (iv) a critique of Birnbaum’s 
analysis and interpretation with two specific objections that invalidate the idea that Birn¬ 
baum’s example is a counter-example to the likelihood principle; and (v) a discussion of the 
‘whatever happened had to happen’ hypothesis which demonstrates that that determinist 
hypothesis is not supported by a valid application of the likelihood principle. 

2 Likelihood and the likelihood principle 

Likelihood functions and the likelihood principle are rarely included in basic statistical 
textbooks and there is little consistency in form or naming of statements of the likelihood 
principle in the literature. Thus a brief discussion of likelihood and a clear statement of 
the likelihood principle is necessary to get this paper going. This section should be read by 
even experienced users of likelihood functions because it emphasizes some aspects of the 
likelihood principle that are usually tacit at best. 

2.1 Likelihood 

Assume a statistical model that has a parameter, 6, that can take any value in the set 0. 
Such a model can be thought of as a meta-model consisting of a family of statistical models 
each of which assigns a single value to 6. Each member of that family provides a probability 
distribution that gives the probability of any value, x, among the possible values, X, given 
the particular value of 9. The probability distributions have values of x as the horizontal 
axis, the ‘x-axis’, and integrate to unity (or sum to unity where X is discontinuous). 
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Those probability distributions are the conventional probability distributions described and 
displayed by most introductory statistics textbooks. They are not likelihood functions. 

Now, assume also that we wish to make inferences about values of the meta-model 
parameter 0 and that we have observed a value, x 0 b s . The value x 0 b s is special because 
it was actually observed, and thus the only relevant value of x € A" is x = x Q b s . The 
individual probability distributions of the family member models contain very little relevant 
information because x 0 b s is only a single point, but if we plot the probabilities associated 
with x a bs from all of the family member models together we get a function with an x-axis 
consisting of values of 0. Such a function treats the observation as fixed and the parameter 6 
as variable, in contrast to the probability distributions described in the previous paragraph 
where 6 is fixed and x is the variable. The integral (or sum) of such a function can have 
any positive real value, which contrasts with a probability distribution where the integral 
(or sum) has to be unity. Such a function is a likelihood function. 

Given the intimate relationship between probability distributions and likelihood func¬ 
tions it might be assumed that they are much the same thing. But that would be wrong 
because likelihoods are only defined (definable) to an arbitrary constant of proportionality. 


and 


2001 


hat arbitrary scaling means that “likelihood does not obey probability laws’" (Pawitan 
, p. 17). 


R.A. Fisher was the first to define likelihood: 

Likelihood .—The likelihood that any parameter (or set of parameters])] should have any 
assigned value (or set of values) is proportional to the probability that if this were so, the 
totality of the observations should be that observed. — ( Fisher Il922 . p. 310) 


Thus we can write: 


L{0 | Xobs) oc P(x 0 bs | 0), for all 6 in 0 (1) 

L(6 | x 0 bs ) = c.P(x 0 bs | d), for all 0 in 0 (2) 

1 Fisher uses the phrase “set of parameters” to mean a vector of parameters rather than a range of values 
for one parameter, as he makes clear where he deals with problems of estimation in that same paper. 
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where c is an arbitrary constant. Usually the “given x 0 b s ” part is assumed and so the 
likelihood L{9 \ x a b s ) is written as just L{6). It is not unusual to see equations like |T] but 
with ‘ = ‘ in place of the ‘ex’ without the provision of an arbitrary constant, and that is 
unfortunate because likelihoods and probabilities are not equivalent. 


2.2 The likelihood principle 

In the paper where he sets out his alleged counter-example, Birnbaum describes two parts 
of the likelihood principle: 


Another general concept of statistical evidence, the likelihood concept (often called the 
likelihood principle ), consists of two parts. One is an axiom which resembles the sufficiency 
axiom and indeed implies the latter. The second part specifies in more positive terms a 
mode of interpretation of statistical evidence, and thus resemble s the confidence concept 
in some respects. 


(Birnbaum 


1 9681 . p. 125) 


The axiom that Birnbaum refers to says that two proportional likelihood functions contain 
the same evidence, but it is often written as a statement that the likelihood function 
contains the all of the evidence from data concerning the merits of various hypotheses. 
Each form of the axiom logically entails the other and so the difference of expression need 
not concern us. The second part that Birnbaum refers to is usually called the law of 
likelihood, and it says that the degree to which the evidence favors one hypothesis over 
another is given by the ratio of the likelihoods of the hypotheses. It is noted that in many 
accounts the likelihood principle is taken to be only what Birnbaum calls the axiom, and 
the law of likelihood is dealt with separately. However, for the purposes of the current 
paper it is necessary to follow Birnbaum’s practice of including both parts in order that 
the critique of his alleged counter-example is fair. 

Birnbaum’s own definitions of the likelihood principle are included verbatim in an ap¬ 
pendix to this paper, but one is quite long and both are more technical than seems necessary. 
Their key features are captured within this succinct statement from Edwards: 

Within the framework of a statistical model, all of the information which the data provide 
concerning the relative merits of two hypotheses is contained in the likelihood ratio of those 
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hypotheses on the data, and the likelihood ratio is to be interpreted as the degree to which 


the data support one hypothesis against the other. 


—(Edwards 


1972 . p. 31) 


That statement has a couple of shortcomings as a definition, aside from its lack of tech¬ 
nical precision. First, the ‘hypotheses’ dealt with in the likelihood principle are nothing 
more than values of the parameters of the statistical model, as Edwards demonstrates in 
examples throughout his book, and as Birnbaum makes clear in both of his definitions 
(see Appendix). That might not be thought to matter, as the use of ‘hypothesis’ as a 
synonym for ‘parameter value’ is common practice in the discussion of likelihood functions 
and the likelihood principle—I’ve done so above—but there is scope for serious confusion. 
Not every hypothesis is a simple hypothesis corresponding to the value of a model param¬ 
eter and not every possible hypothesis can be accommodated by any particular statistical 
model. Thus hypotheses will exist that cannot be compared using the likelihood principle. 
It is also important to note that the likelihood principle does not provide a mechanism for 
comparison of simple hypotheses that map onto a single value of a parameter with com¬ 
posite hypotheses that specify ranges of parameter values. Birnbaum writes explicitly that 
the likelihood principle “specifies no further structure or interpretation for the likelihood 
ratio scale, nor any specific concept of ‘evidence supporting a set of parameter points.’ ” 


(iBirnbaum 1969. p. 126). Finally, the scope of the law of likelihood is naturally restricted 


because the scaling of a likelihood function is always arbitrary and entails an unknown pro¬ 
portionality constant. That unknown constant doesn’t matter if two likelihoods of interest 
lie on the same likelihood function because the constant is the same for both and therefore 
cancels in the likelihood ratio, but otherwise it leads to potentially meaningless likelihood 
ratios. Thus the law of likelihood only tells us how to measure the relative support of 
parameter values that lie on the same likelihood function. That restricted scope of the 
law of likelihood is not always noted, but Edwards includes it obliquely in his definition by 
writing “Within a statistical model”, and it is implied in Birnbaum’s definition where he 
writes that the law of likelihood “consists of the statement that the evidence supporting 
one parameter point i against another i! is represented just by the numerical value of the 
likelihood ratio = Pij/pi'f be cause i an d i 1 ar e two parameter values on the same 

likelihood function: u p i j of i, i G fl” (Birnb au m 19621. p. 126) . 
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With all of that as a preamble, we can briefly state the likelihood principle in a manner 
that respects both Edwards’s and Birnbanm’s versions: 

The likelihood principle Two likelihood functions which are proportional to each other 
over the whole of their parameter spaces have the same evidential content. Thus, within 
the framework of a statistical model, all of the information provided by observed data 
concerning the relative merits of the possible values of a parameter of interest is contained 
in the likelihood function of that parameter based on the observed data. The degree to 
which the data support one parameter value relative to another on the same likelihood 
function is given by the ratio of the likelihoods of those parameter values. 

2.2.1 Example to illustrate the likelihood principle 

Consider a simple but artificial example where the probability of rain during each day of 
the current week are available via a seven day forecast from the local bureau of meteorol¬ 
ogy. Assume that there is no intermediate state between rain and not rain so that those 
states are mutually exclusive and exhaustive, and assume that the only time unit of in¬ 
terest is ‘day’. With those assumptions we get a statistical model where the probability 
distribution for each day consists of the forecast probability of rain and the complement of 
that as the probability of not rain. Using that statistical model we have only two possible 
observations, x G {rain, not rain}, and days of the week are the only model parameter, so 
6 G {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}. Table [Tj shows 
the probabilities. Each column of probabilities displayed in that table is a probability dis¬ 
tribution that sums to unity, and the rows in the table are proportional to the likelihood 
functions associated with the observations of rain and not rain. 

Table 1: Probabilities of rain for the current week 


Monday Tuesday Wednesday Thursday Friday Saturday Sunday 
P(rain) 0 0.07 0.65 0.2 0.05 0.01 0.01 

P (not rain) 1 0.93 0.35 0.8 0.95 0.99 0.99 


There are two types of question that might be usefully addressed using the probabilities 
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and likelihoods in the table. First, should you pack an umbrella? The answer will depend on 
your personal loss function (e.g. do you care about a bit of rain?), but it might be reasonable 
to pack an umbrella on Wednesday and, maybe, Thursday. The answer to that umbrella 
question is informed by individual probabilities: the 65% chance of rain on Wednesday 
provides a reason to carry an umbrella without reference to any other probability in the 
table. Such a decision treats the parameter value (Wednesday) as a given and the observable 
(rain) as a variable. The second type of question that can be addressed using the table 
goes the other way, from the observation to the parameter: what does an observation of 
rain tell you about what day it is? Say you wake from unconsciousness and observe that 
it is raining and wonder what day it is. You notice that table [1] is posted on the wall with 
a prominent label “This week’s rain probabilities”. The observation of rain means that 
only the upper row in table [T] is relevant, and within that row it is Wednesday that has 
the highest probability of rain occurring. The observation of x — rain provides evidence in 
favor of the day being any day where there is a non-zero probability of rain, and it provides 
stronger evidence for the day being a day where there is a high probability of rain. Thus 
the observation x = rain supports 6 = Wednesday more strongly than it supports any other 
value of 6. Wednesday is the maximal likelihood estimate of 6 given the observation of rain 
and the probability model represented by table [H 

It is worth stating explicitly that while the first type of question—should you carry 
an umbrella?—is addressed using individual probabilities, the second type of question— 
what day is it?—requires a comparison of all of the likelihoods in the relevant row of the 
table. The relevance of that distinction can be seen by considering the effect of changing 
the probability of rain in the table. If, for example, the probability of rain on Friday 
was changed from 0.05 to 1, it would not affect whether you should carry the umbrella 
on Wednesday, but it would change the maximal likelihood estimate from Wednesday to 
Friday even though the evidence itself, x = rain, was not changed. The use of likelihoods to 
make inferences about parameter values on the basis of observations requires a comparison 
of the likelihoods of the various values of the parameters. 

The likelihood principle tells us that, given the model where probabilities of rain come 
from the forecast of the bureau of meteorology, all of the information from the observation of 
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x = rain relevant to estimation of the day of the week is contained in the likelihood function 
that is proportional to the probabilities in the top row of table [fl and that the evidential 
favoring of that observation for each day relative to any other is proportional to the ratio of 
those likelihoods. Thus the observation of rain stands as evidence in favor of the day being 
Wednesday over the day being Thursday by the ratio of 0.65/0.2 = 3.25 and simultaneously 
stands as evidence favoring Thursday over Saturday by the ratio 0.2/0.01 = 20. 

3 Examples analogous to Birnbaum’s problem 

Birubaum’s counter-example provides a situation where, at least on the face of it, appli¬ 
cation of the law of likelihood yields strong support for a hypothesis where intuitively it 
seems that there should be none. The counter-example involves a single observation taken 
from a population of unknown mean and the hypotheses in question concern the spread of 
values in the population sampled: either it has a wide spread such that observations can 
take values of up to 100 either side of the unknown mean; or it has no spread so that all 
observations would take the value equal to the unknown mean. Birnbaum’s analysis led 
him to say that, no matter what value is observed, the hypothesis that the population in 
question has no spread will have a markedly higher likelihood than the hypothesis that the 
population has a wide spread. 

Birnbaum’s analysis is flawed, but the nature of the flaw is subtle and so it is helpful 
to begin with analyses of analogous examples that are more accessible than Birnbaum’s in 
some details. Two examples are supplied here, one very simple and another slightly more 
complicated but directly analogous to Birnbaum’s. Both utilize a model of the favored by 
statisticians: urns containing ed balls. 

3.1 Example 1: Known color 

Imagine that an honest statistician places an urn on your desk and tells you that it is 
either urn la which contains 10,000 red balls or it is urn lb which contains 100 red balls 
and 9,900 balls made up of 200 different colors, none of which is as numerous as the red 
balls. He invites you to determine which urn you have on the basis of the colors of sampled 
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balls. You may randomly sample as many balls as you like as long as it is sampling with 
replacement. 

How many balls should you sample? The answer depends on what color(s) you observe 
in your sample. If you draw a ball of any other than red then even a single ball is definitive 
evidence that the urn is lb because only urn lb contains non-red balls. If you draw a red 
ball then that is evidence in favor of the urn being urn la because la contains a larger 
proportion of red balls than urn lb but, as urn lb has some red balls, the evidence of 
a single red ball would not be conclusive. Thus you might choose to draw more balls to 
increase the strength of evidence in favor of urn la over lb. The larger the number of red 
balls observed without any other colors, the stronger that evidence would be, and more 
confident you could feel, that the urn was la, but drawing even a single non-red ball which 
would confirm that the urn was lb. 

This is a simple problem that probably elicits the same intuitive response from all 
readers, one that matches that description in the previous paragraph. It is also one that 
matches the results of a formal likelihood analysis of the problem. 

A formal likelihood analysis of the problem requires a statistical model in which speci¬ 
fies the probabilities of the possible observations, and for this problem those probabilities 
obviously derive from the frequencies of the various colored balls within the urns. However 
we do not know most of those frequencies and so we cannot construct a complete sampling 
probability distribution based on individual colors. A convenient solution is available. As 
all non-red colors are equally informative about the question at issue (the identity of the 
urn), we can collapse the sample space into red and non-red to obtain a fully specified 
sampling distribution. At first glance, the relevant hypotheses concern the urn on your 
desk— H i 0 : urn la; and Hu,: urn lb—but the hypotheses in a likelihood analysis have 
to be parameter values. Conveniently we can obtain such hypotheses by noting that H\ a 
implies that there is only one color of ball in the urn, and Hi b there are 201 colors, so we 
have H[ a : v c — 1 and H' lb : u c = 201, where u c is a parameter specifying how many 
colors are in the urns. If H[ a is true then so is H la , and if H[ b is true then so is H lb , so we 
can logically equate the hypotheses concerning the identity of the urn with values of the 
parameter u c . 
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Table [2] shows the probabilities of observing red and non-red balls according to the 
relevant model for this problem. Note that the probabilities sum to unity across the rows 
because the two possible observations, x = red and x ^ red are exhaustive of the full 
sample space, but the columns do not sum to unity because the each row is a mutually 
exclusive condition or hypothesis. 

Table 2: Probabilities of drawing red and non-red balls for example 1 


x = red x ^ red 

Urn la, v c — 1 1 0 

Urn lb, v c = 201 0.01 0.99 


The relevant likelihood function is are derived from the probabilities in table [2j As 
the parameter values u c — 1 and v c = 201 are mutually exclusive and exhaustive, each 
likelihood function consists of only two points. For a sample of more than one ball we only 
need to keep track of the number of red and non-red balls because the sequence of the 
observations is not informative as a consequence of the sampling being with replacement. 
Thus, for a sample of n — r + m balls consisting of r red balls and m non-red balls, the 
likelihood function is this: 


L(H[ a ) (X P(x = red | H[ J x P{x ± red | H' la ) m 

oc P(x = red | v c = l) r x P(x ± red | v c = l) m = l r x 0 m 

L(H[ b ) oc P(x = red | H[ b ) r x P(x ± red | H' lb ) m 

oc P(x = red | u c = 201) r x P(x ± red | v c = 201) m = 0.01 r x 0.99 r 


(3) 


L(H [ a ) = 1 where m — 0 and is zero otherwise!. The likelihood ratio for any observed 
mixture of balls is just the ratio of those two values. 


L(H[ a ) V x 0 m 
L(H' lb ) 0.01 r x 0.99 m 


(4) 


2 Taking 0° = 1. 
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Birnbaum’s problem involves only a single observation, so it is worth calculating the 
likelihood ratio for this example under the two possible n — 1 outcomes. If a red ball is 
drawn, r = 1 and m — 0, the likelihood ratio is 100:1 in favor of H[ a , and if a non-red ball 
is drawn, r = 0 and m — 1, the likelihood ratio is 0, which is conclusive evidence in favor 
of H' lb over H[ a . That likelihood analysis matches the intuitive responses to the possible 
results in this example, as described above. 

3.2 Example 2: Unknown color 

The second example is identical to the first except that the color shared by the two urns is 
not known. Thus, imagine that an honest statistician places an urn on your desk and tells 
you that it is either urn 2a which contains 10,000 balls of one color, or it is urn 2b which 
contains 100 balls of that same color and 9,900 balls made up of 200 different colors, none 
of which is as numerous as the balls with the shared but unknown color. He invites you to 
determine which urn you have on the basis of observed colors of sampled balls. You may 
sample as many balls as you like as long as it is sampling with replacement. 

How many balls should you sample? As in example 1, the answer depends on what 
color(s) are observed. This is more complex than the previous example, but an intuitive 
interpretation of possible outcomes is still fairly clear. This time there is no possible n = 1 
observation that could stand as evidence for one urn over the other because any single color 
might be the color that is present in both urns, and a sample of one ball will certainly have 
a single color. Thus the sample needs more than one ball for it to provide evidence that 
discriminates between the urns. If all of the balls in a sample of more than one have the 
same color, that would stand as evidence in favor of the urn being 2a. However, as soon 
as the sample contains multiple colors the evidence conclusively favors urn 2b. 

As in the previous example, the statistical model has probabilities determined by the 
frequencies of the ball colors in the urns, but in this example we do not know any of those 
frequencies. The probabilities can only be had by making an assumption about the color 
that is shared by the urns, a color that will be denoted by fi. For example, if we assume 
that p = blue, the probabilities are those in table [3j That table is larger than the table 
in example 1 because the model for this example has two unknown parameters, p and u c , 
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where the model in example 1 only has v c . The left and right halves of table [3] are mutually 
exclusive alternatives, so each row in the table contains two probability distributions and 
sums to 2. Unfortunately, the probabilities in table [3] do not yield a usable likelihood 
function, even if an observed color is substituted for ’blue’ in the table, as we would not 
know which half of the table is relevant. 


Table 3: Probabilities of drawing blue and non-blue balls for example 2 



[i = blue 

/i 7 ^ blue 


x = blue x 7 ^ blue 

x = blue x 7 ^ blue 

Urn 2a, u c = 1 

1 0 

0 1 

Urn 2b v c = 201 

0.01 0.99 

< 0.01 > 0.99 


A likelihood function relevant to this problem can be obtained, but its derivation is less 
obvious than in the previous example. The difficulty is that even though we don’t really 
care about the value of fi as it is not relevant to the question about the identity of the urn, 
without knowledge of /a we cannot interpret the observation of any single color. However, 
as the question of interest does not concern the color of balls, a likelihood function based 
on the number of colors in the sample rather than the nature of the color(s) can serve our 
purpose. Conveniently, that function is independent of the nuisance parameter /a. 

Let n c be the number of distinct colors in a sample of n balls. Table 3 presents the 
probabilities for observations of n c = 1 and n c > 1. The value 0.01 ) n_1 in that table 
requires a little explanation. The probability of observing x = /a when a ball is drawn 
from urn 2 b is 100 / 10,000 = 0 . 01 , and because of the other colors are less numerous (for 
consistency with Birnbaum’s example, below), the probability of observing any other color 
is less than 0.01. The exponent is n— 1 because even thought any observation would satisfy 
n c = 1 when n — 1 , repeated draws of a single color are needed for n c = 1 with n — 2 or 
more. 
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Table 4: Probabilities of a sample of n balls containing n c distinct colors in example 2 


n c =1 n c > 1 

Urn 2a, v c — 1 1 0 

Urn 2b, v c = 201 « 0.01)"" 1 1 - 0.01)"- 1 


We use the same logic as in the previous example to equate the hypotheses about 
the identity of the urns, H 2a and H 2 b, with hypotheses H 2a and H' 2b which relate to the 
parameter u c . For an observed n c — 1 from a sample of n balls we get the likelihood 
function 


L(H 2a ) oc P(n c = 11 v c = 1) = 1 
L(H' 2b ) oc P(n c = l\v c = 201) = (s£ 0.01) n_1 , 
and for n c > 1 the likelihood function is 


L(H' 2a ) oc P(n c > 11 v c = 1) = 0 

L(H 2b ) oc P(n c >l\v c = 201) = 1 - « 0.01) n_1 . 


( 6 ) 


The likelihood ratio also depends on whether the observation is n c = 1 or n c > 1. For 
n c — 1 it is 

= i m 

L(H'u) « 0 . 01 )”- 1 ’ y 1 

and for n c > 1 it is 

MXL) = o 

L{H^) 1 - « 0 . 01 )”- 1 ' 

Again we should consider the possible results when only a single ball is drawn for 
consistency with Birnbaum’s problem. If n = 1 then the test statistic will necessarily be 
n c = 1 so the relevant likelihood function is equation |7] Where n = 1, n — 1 = 0, so the 
likelihood ratio will be 1:1. That result means that no matter what color is observed with 
a single draw the evidence is neutral with respect to the hypotheses concerning the identity 
of the urn. That result matches an intuitive response to the example. 
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4 Birnbaum’s counter-example 


Birnbaum’s example is similar to example 2 above, but in the space of numbers rather than 
colors. It consists of a single observation, x, drawn from a population with unknown mean 
/i and a spread parameter, a, that is either 0 or 100. Birnbaum writej 3 4 ]: 

Let 6 = (//, <j): The values o = 0 or 100, respectively, represent the unknown precision, 
either very high or very low, of a single observation (measurement) x\ and an integer, 

—10 10 < [i < 10 10 , is the unknown true value of a quantity to be measured. Let 


f(x,0) = f(x,n,a) 


1 , if (j = 0 and x = /j 

0 , if (7 = 0 and i//i, for all x, p; 

< 

c[100 — \x — /J,\] if cr = 100, for \x — fi\ < 100, 
0 otherwise. 


Thus <7 = 0 gives an error-free measurement x = /z; a = 100 gives a wide “triangular” 
distribution of x centered at /i. The constant c = 1/10,040 = 10~ 4 (to give total probability 
1). The likelihood function determined by any of the possible observations x, \x\ < 10 s , is 


f(x, n, cr) = 1, for a = 0 and fj, = x; 

= 100c = 10 -2 , for a = 100 and n = x; 
< 100c = 10 -2 , for a = 100 and |i/i. 


A simple application of the likelihood principle, in accord with the account of the preceding 
sectiorQ is the following: The parameter point 6 = (//, a) = (x,0), that is, fi = x and 
cr = 0, has uniquely maximum likelihood. (Its likelihood ratio, as against each other point 
in turn, exceeds unity.) Thus it seems to be supported by the observed result, as against 
each and all other parameter points. This includes, in particular, that the value a = 0 
seems to be supported against the alternative cr = 100. Evidently the numerical values of 
the likelihood ratios referred to, all exceeding 100, represent strong evidence in some sense. 


3 There are several obvious typographical errors in the original that have been corrected here to avoid 

confusion. None of the errors changed the sense of the text or the formulae. 

4 [Reproduced in the appendix to this paper.] 
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Evidently such interpretations of statistical evidence must be regarded as misleading, and 
strongly misleading, in case the true value of o is not 0 but 100. But such misleading 
interpretations will be suggested by the likelihood principle with probability unity, if for 
example a = 100 and fi = 0 are the true parameter values, since in this c ase e ach possible 


outcome x determines a likelihood function of the form considered, 
pp. 127-128) 


(Birnbaum 


I960 


Birnbaum’s analysis leads to the conclusion that no matter what single value of x is 
observed, the likelihood associated with the hypothesis that a = 0 will always be at least 
100 times larger than that associated with the hypothesis a = 100, even if the latter 
hypothesis is true. It is generally assumed that the example is therefore a counter-example 
to the likelihood principle. 


5 Critique 

Birnbaum’s alleged counter-example seems to be closely analogous to example 2 presented 
in section 13.21 in that they both present a choice between two populations which have a 
fixed, but unknown, parameter in common and a parameter representing distributional 
spread that takes one of two specified values. However, despite the structural similarities, 
example 2 and the alleged counter-example yielded quite different outcomes—one seems to 
support the likelihood principle and one that is generally taken to represent an important 
counter-example to the likelihood principle. At least one of three things must be true: 
the examples are disanalogous in some important way; one of the analyses is flawed; the 
counter-example is a counter-example to the particular analysis used by Birnbaum rather 
than to the likelihood principle in general. Those possibilities are be discussed in sections 
below. 

5.1 The examples are disanalogous 

There are several reasons to conclude that there is a close analogy between example 2 and 
Birnbaum’s example. 
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1. The desired inferences in the examples are equivalent in that both offer a choice 
between two hypothetical populations. 

2. One population in each example consists of a single class of objects whereas the other 
has more than one. 

3. Both examples use discrete scale of measurement so that the colors of the balls can 
be mapped one to one onto integer values in Birnbaum’s example. 

4. The parameters of the statistical models can be mapped one to one onto each other: 
H is a value shared by the two possible population in both; and Birnbaum’s spread 
parameter, a, has its equivalent in example 2, the number of colors of ball in the urns, 
v c , with the simple linear relationship (y c — l)/2 = cr. Thus the parameter space for 
both examples is equivalent and can be denoted in the same manner: 9 = (//, cr). 

5. In both Birnbaum’s example and example 2 a second observation would yield either 
definitive evidence that a = 100 if the two observations were dissimilar, or strong 
evidence with a likelihood ratio of at least 10 4 in favor of cr = 0 if the observations 
were identical. 

Those considerations seem sufficient to conclude that example 2 is a close analogue of 
Birnbaum’s example. 

5.2 One of the analyses is flawed 

If Birnbaum’s example and example 2 are closely analogous then the divergent results must 
come from differences in the analyses used. That can be confirmed by simply applying the 
analysis used in example 2 to Birnbaum’s problem. Therefore, assume that we observe 
x = 17 so that the full set of probabilities are those in table [5] That table can be obtained 
by just substituting into the example 2 table ‘17’ for ‘blue’ and approximating 100c with 
0.01, and, for a likelihood analysis, it presents exactly the same quandary as it did in 
example 2: which half of the table is relevant? Using the number of distinct values in the 
sample, n v , as a test statistic side-steps that difficulty and gives the probabilities in table [6] 
Those probabilities are the same as those in table [H so the likelihood functions are also the 
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same as those in example 2. That demonstrates that application of the example 2 analysis 
to Birnbaum’s problem yields a result where the likelihood-based interpretation of the 
evidential meaning of the observation matches the intuitive response. It can be concluded 
that the essence of the counter-example lies in the analysis rather than the logical structure 
of the problem. 

Table 5: Probabilities of observing x — 17 and x ^ 17 for Birnbaum’s example (using 
100c = 0.01). 



H = 17 

P 7 ^ I 7 


x — 17 x ^ 17 

x = 17 

x ^ 17 

a — 0 

1 0 

0 

1 

o 

o 

0.01 0.99 

< 0.01 

> 0.99 


Table 6: Probabilities of a sample of n observations containing n v distinct values in Birn¬ 
baum’s example (using 100c = 0.01). 


n v = 1 n v > 1 

a = 0 1 0 

a = 100 « O.Ol)™" 1 1 - « O.Ol)”- 1 


5.3 Objections to Birnbaum’s analysis 

The success of the example 2 analytical formulation in dealing with Birnbaum’s example 
does not necessarily mean that Birnbaum’s analytical formulation is flawed, but two inter¬ 
related objections can be raised. First, Birnbaum’s analysis leads to the inference being 
determined by an inferentially irrelevant parameter and, second, the statistical model used 
in that analysis has two parameters, /i and a, whereas the problem supplies only a single 
observation. 
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5.3.1 Objection 1: outcome determined by inferentially irrelevant parameter 

Birnbaum’s presentation of the parameter of interest is somewhat contradictory. In the 
description of the problem he writes that // is “the unknown true value of a quantity to be 
measured”, but that only means that /j, is the parameter of interest to the agent making 
the measurement. That agent does not face any monstrous dilemma in determining the 
best supported value of that parameter because there is nothing misleading in the support 
of /i = x over alternatives that the likelihoods provide. The allegedly misleading evidence 
concerns not q, but cr, as Birnbaum makes clear: 

This includes, in particular, that the value o = 0 seems to be supported against the 
alternative a = 100. [... ] Evidently such interpretations of statistical evidence must be 
regarded as misleading, and strongly misleading, in case the true value of cr is not 0 but 
100. But such misleading interpretations will be suggested by the likelihood principle with 
probability unity, if for example a = 100 and q = 0 are the true parameter values, since in 
this case each possible outcome x determines a likelihood function of the form considered. 


As the alleged counter-example rests entirely on the behavior of the analysis with regard 
to cr, the parameter of interest to us must be a. In that case the value of q is incidental to 
the problem—in so far as it plays a role in the analysis it is as a nuisance parameter. An 
appropriately designed analysis would ensure that estimation of a is not influenced by that 
nuisance parameter. Section 15.21 contains such an analysis, and it provides a likelihood 
function supporting conclusions that mirror our intuitive response to evidence, in stark 
contrast to those that are drawn from Birnbaum’s function. 

We have two different likelihood functions that are, apparently, appropriate for the same 
experiment, and they yield contrasting pictures of the evidence. The question therefore 
arises as to which of those functions is correct. They are both correct in the sense o 
being valid likelihood functions for the experiment, despite Birnbaum’s being incomplete 
However, each function has a different inferential scope. Birnbaum presents the likelihood 
function for the vector parameter 6 = (/x, a) and it provides likelihood ratios that quantify 
the relative merits of the various values of 6 , in particular it shows that any observation of x 


5 Birnbaum’s likelihood function lacks a likelihood for 9 = (q ^ x, a = 0). As the likelihood of that 
parameter value is zero, its omission does not much matter. 
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supports 9 = (x, 0) over any other value of 9. It may seem natural to assume that the best 
supported value of a is therefore 0, but that assumption would be wrong. Simultaneous 
assessment of the evidence regarding /i and a as components of a vector need not yield 
the same result as the assessment of the evidence regarding either of them as independent 
scalars. In the context of Birnbaum’s problem, the p dimension of 9 affects the probability 
of observing x more sharply than does the a dimension and so the relationship between x 
and p plays a dominant role in shaping the evidence regarding values of 9. Of course, we 
are interested in a rather than /i or 9 , so a dominant influence of p over a in the analysis 
is disastrous. We need to deal with /j as a nuisance parameter. In sections 13.21 and 15.21 
p is eliminated by the choice of parameterization, but there are other strategies could be 
equally effective, in particular a marginal likelihood function for a obtained by integrating 
out p would yield the same result of equal support for a = 0 and a = 100. Thus there is 
no contradiction in the data providing equal support for a = 0 and cr = 100 at the same 
time as it shows strong support for 9 = (x, 0) over 9 = (x, 100), but where the parameter 
of interest is a the latter result is not relevant. 

If Birnbaum’s example is a counter-example, it is a weak counter-example in that the 
misleading outcome is very easily avoided by appropriate treatment of a nuisance parame¬ 
ter. 

5.3.2 Objection 2: more parameters than data 

Birnbaum’s model has more parameters than the number of available data points, or, given 
the vector notation used in places for the parameter 9, the model has a parameter with 
more dimensions than the number of data points. Either way it represents a condition that 
is often called ‘overfitting’. The pivotal role of overfitting in Birnbaum’s example is readily 
seen by simply considering two variants of his example that lack overfitting: (i) where there 
is only one parameter; and (ii) where there is a second datum. 

(i) Single parameter variant Say that it is known that the true value to be measured is 
/i = 17 and we wish to determine whether cr = 0 or a = 100 in order to know about 
the measurement precision. The model has only one parameter to be estimated from 
the data and the relevant probabilities are those in the left half of table [5] This 
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single parameter variant of Birnbaum’s problem has exactly the structure of the balls 
problem presented in example 1, and the intuitive response to the evidence presented 
by any possible observation matches exactly the relevant likelihood function. Thus 
the alleged counter-example does not survive modification into a single parameter 
variant. 

(ii) Two observation variant Say that neither part of 9 = (/a, a) is known and two 
values are observed, x\ and X 2 . It should be intuitively clear that if x\ = x 2 then 
there would be strong support for the hypotheses that 6 = (/x = x\ = X 2 , cr = 0) 
and, equivalently, if X\ and x 2 differ then they would provide proof that cr ^ 0 and 
thus support 6 = ( :ri +- T2 ; cr = 100) over any other value of 6. The type of analysis 
used in example 2, where the number of distinct observations, n c , is used as a test 
statistic, would be applicable to this problem, but an analysis using the observed 
values directly is equally applicable and, as it is more similar to the original analysis 
by Birnbaum, it will be presented, first for the case where x\ = x 2 = 17 and then 
where x\ = 17, x 2 = 132. 

For an observation of x\ = x 2 = 17, the likelihood function consists of two lines on //: 


L(/i,a 


0) oc 


{ 1, where 117 — n\ 
0, otherwise 


0 


L(/i,u 


100) oc 


{ (c(100 — 117 — /u|)) 2 , where 117 — n\ < 100 
0, otherwise. 


(9) 


( 10 ) 


That likelihood function supports the parameter 6 = (17, 0) over all others by a 
margin of at least 10 4 , but, given that an observation of x\ = x 2 would occur with a 
frequency of less than 10~ 4 if a — 100, that support would very rarely be misleading. 

For an observation of X\ = 11,x 2 = 132 the likelihood function is also two lines on //, 
but one of the lines, where a = 0, is a flat line at zero and the other, where a = 100, 
consists of a line mostly at zero but with a central concave down parabola in the 
region between /i = 33 and fi = 117: 
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1, where 117 — u| = 0 

L(/i, a = 0) oc < 

0, otherwise 


( 11 ) 


L{h,<t 


100) cx 


{ (c(100 — 117 — yu|)) 2 , where 117 — /x| < 100 
0, otherwise. 


( 12 ) 


That likelihood function displays strongest support for the parameters 9 = ((aq + 
X 2 )/ 2 , 100), and that support would never be misleading concerning the cr dimension 

of e. 


Thus it is demonstrated that the likelihood analysis of the problem when there are two 
data points results in strong support for 9 containing cr = 0 when the two observations 
are identical and 9 containing cr = 100 when they differ, both cases which are the 
same as the intuitive response to the problem. The probability of obtaining strong 
misleading support for cr = 0 when cr is actually 100 is less than 10 -4 . The alleged 
counter-example does not survive the inclusion of a second observation. 


Both modifications of Birnbaum’s example to allow the number of data points to equal 
the number of estimated parameters disarmed the counter-example by making probability 
of misleading support for a false hypothesis very low or zero. Thus, to the extent that 
Birnbaum’s example is a counter-example, it is a counter-example to the likelihood principle 
only in the specific circumstance that the number of parameters exceeds the number of data 
points. Such reliance on overfitting may of itself be sufficient to dismiss the alleged counter¬ 
example because many statistical approaches are known to behave erratically where the 
number of estimated parameters exceeds the number of data points, and such misbehavior is 
not routinely assumed to be indicative of a counter-example to a statistical or philosophical 
principle. 


6 What happened had to happen 

The assumption that the likelihood principle allows the best supported hypothesis in any 
investigation to be a determinist hypothesis—a hypothesis that whatever happened had to 
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happen—stands as an important criticism of the likelihood principle. Two distinct types of 
determinist hypotheses occur. The first kind corresponds to a (set of) parameter value(s) 
within a statistical model that provide for only a single specific outcome. The second kind 
of determinist hypothesis is that no matter what happened, it had to happen. Each kind 
is discussed in this section. 

A determinist hypothesis of the first kind is instantiated in Birnbaum’s analysis by the 
parameter value 9 = (a?, 0): no matter what single value of x is observed, that parameter 
value has a higher likelihood than any other even if the true value is 9 = ( y , 100). ft leads 
Birnbaum to write that “misleading interpretations will be suggested by the likelihood 
principle with probability unity”, but that is a true statement in only a very limited sense. 
If it is intended to mean that misleading interpretations will be suggested in every likelihood 
analysis of every investigation, it is false. Neither the non-overfitted variants of Birnbaum’s 
problem set out in section 15.3.21 nor the urn problems presented in sections 13.11 and 13.21 
contain parameter values that would be misleadingly supported with unity probability. 
Thus the statement that “misleading interpretations will be suggested by the likelihood 
principle with probability unity” is relevant specifically to the analysis and experiment 
that Birnbaum supplies. The presence of this type of determinist hypothesis in a statistical 
model is not universal and may be rare. 

The universality of the whatever happened had to happen hypothesis comes only with 
the second type of determinist hypothesis, one that Jaynes described as a ‘sure thing’ 


hypotheses wherein “every detail of the data was inevitable” flJavnes 


2003 


, p. 195). To 


obtain such a hypothesis requires invocation of Descartes’ evil demon, a device beloved 


by philosophers. The evil demon is a creature who is able to determine t 


observations and is determined to mislead the investigator, and [Sober (11988 


re value of 
pp. 183- 


187) argues that the evil demon is a nuisance parameter in ever y likelihooc 
though it is usually invoked only implicitly. For example, M ayo & Sp anos 


ana 


ysis even 


2 0111) criticise 


the likelihood principle using a very straightforward example where there is a determinist 
hypothesis, H*, in the statistical model for a coin-tossing experiment. 


For an extreme case, H* can assert that the probability of heads is 1 just on those cases 
that yield heads, 0 otherwise. [... ] So the fair coin hypothesis is always rejected in favor 
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of H*, even when the coin is fair. 


(Mavo & Spanos 


2011 


p. 184) 


There is no mention an evil demon, but it would be an evil demon who sets the probability 
of heads for each toss of the coin in H*. That evil demon provides a reason to mistrust 
the likelihood principle because everyone would agree that a preference for H* on the basis 
of any observed data would be silly. However, contrary to what might be implied by 
Mayo & Spanos, the likelihood principle does not entail such a preference because the fair 
coin hypothesis and H* lie on distinct likelihood functions because they exist in different 
parameter spaces. Not only that, but they cannot be contained in a single simple statistical 
model because of their probability models are contradictory. To flesh that out, not that 
the statistical model universally used for a coin tossing experiment has a parameter space 
consisting of p, the fixed but unknown probability of heads for each trial, and n, the 
number of trials. The probability of any observable number of heads, h where 0 ^ h ^ n, 
is obtained from the binomial distribution, Bin(n,p), and so the likelihood function for p 
when we observe h heads is: 


L(p) oc/(l -p) n ~ h (13) 

That model treats the value of p is an unknown constant that is the same for each toss of 
the coin, and the fair coin hypothesis maps onto the hypothesis that p = 1/2. Contrast 
that with Mayo & Spanos’s clearly stated determinist hypothesis whereby p varies from 
toss to toss. The varying p G N{0,1} of the determinist hypothesis is a different thing from 
the fixed but unknown constant p 6 K,0 ^ p ^ 1, of the set of hypotheses that include 
the fair coin hypothesis, and so it is not among the points of the likelihood function given 
in equation [13] As the likelihood principle offers no guidance as to how to compare the 
nonsensical H* with the fair coin hypothesis, the likelihood principle cannot be blamed for 
any misguided acceptance of H*. Further, in this context, a lack of guidance should hardly 
be seen as a failing. 

Some will be tempted to say that the determinist hypothesis can be grafted onto the 
likelihood function given in equation [13] and therefore that hypothesis can serve to un¬ 
dermine the likelihood principle even where it is not explicitly included in the function. 
However, such a cobbled together model is inappropriate for two reasons. First, the both 
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the parameter spaces and the probability mechanisms of the two parts of the statistical 
model would be distinct and so a cobbled together union of two models remains two distinct 
models, one where the outcomes result from probabilities determined by the binomial dis¬ 
tribution using the parameters n and p, and the other where the outcomes are determined 


by the whim of an evil demon. In that case, some prior probabilities would 


De n ecessary to 


weight t 


le two parts of the combined model and, as others have pointed out ([Edwards 


Pawitan 


2001 


1992, 


Royall 2004 ). one would inevitably lean towards a very low prior probability 


on the evil demon arm of the model. Second, given that the choice of statistical model is 
a arbitrary decision on the part of the analyst, why would anyone chose to include an evil 
demon hypothesis along with the more interesting hypotheses? It is no more sensible to 
consider the evil demon hypothesis than it would be to consider, for example, a probability 
model for the urn problems where the probability assigned to the drawing of green balls is 
7.9 times the proportion of balls in the urn that are green. Even though it could be done, 
it would not be sensible to do so. 

The whatever happened had to happen hypothesis has only a very limited capacity 
to undermine the likelihood principle. It is relevant to likelihood analyses where there is 
overfitting, as in Birnbaum’s analysis but not otherwise, and, even so, it is probably no more 
important to avoid overfitting in likelihood analyses than it is in other statistical analyses. 
The evil demon nuisance parameter is easily avoided by simply choosing a statistical model 
which does not include it, as all sensible models probably do. No example where an evil 
demon is invoked, explicitly or implicitly, can serve as a general counter-example to the 
likelihood principle. The notion that a likelihood analysis will always support a hypothesis 
that whatever happened had to happen is wrong. 

7 Appendix: The likelihood principle according to 
Birnbaum 


This appendix is the contains two versions of the likelihood principle given by Birnbaum, 
the first from his 1962 paper which contains his proof that the likelihood principle is entailed 
by the conditionality and sufficiency principles (Birnba um 1962) and the second from his 
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pap er of 1969 w 


flBirnbaum 


1969 


lich contains the alleged counter-example that is the subject of this paper 
pp. 125-126). 


7.1 Birnbaum’s 1962 version 

The likelihood principle (L): If E and E' are any two experiments with the same 
parameter space, represented respectively by density functions f(x,9) and g(y,9)-, 
and if x and y are any respective outcomes determining the same likelihood function; 
then E v(E,x) = E v(E',y). That is, the evidential meaning of any outcome x of any 
experiment E is fully characterized by giving the likelihood function cf(x,9) (which 
need be described only up to an arbitrary positive constant factor), without reference 
to the structure of E. 


7.2 Birnbaum’s 1969 version 

Another general concept of statistical evidence, the likelihood concept (often called 
the likelihood principle ), consists of two parts. One is an axiom which resembles the 
sufficiency axiom and indeed implies the latter. The second part specifies in more 
positive terms a mode of interpretation of statistical evidence, and thus resembles the 
confidence concept in some respects. 

To formulate these, we require the definition of the likelihood function (which 
plays important roles, technical and conceptual, in various areas of mathematical 
statistics): For each model ( E,j ) of statistical evidence, the function p^ of i,i £ fi, 
is called the likelihood function. Here j is fixed. More precisely, the function pij 
is specified as one among many alternative, equivalent representations of the same 
likelihood function, ah having the form cpij where c denotes an arbitrary positive 
number. 

We discuss the axiom first: 

(L): The likelihood axiom: If two models of statistical evidence ( E,j ) and ( E',j ') 
determine the same likelihood function, then they represent the same evidential 
meaning. That is, if for some positive c we have p^ = cp\-, for each i£!l (the 
common parameter space of E and E'), then Ev(E,j ) = Ev(E',j') (where 
E = (p ij ),E' = (p' j ,)). 
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We note that when we take E' = E, this becomes just the sufficiency axiom. 

The second part of the likelihood concept complements the likelihood axiom, by 
indicating to some extent how a likelihood function may be interpreted as a represen¬ 
tation of evidential meaning. It consists of the statement that the evidence supporting 
one parameter point i against another i' is represented just by the numerical value 
of the likelihood ratio L{i,i') = Pij/pi/j, with the value of unity marking neutral evi¬ 
dence and successively larger values indicating stronger support for i against i'. The 
concept specifies no further structure or interpretation for the likelihood ratio scale, 
nor any specific concept of “evidence supporting a set of parameter points.” 
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